Skip to content

Conversation

@JonathanC-ARM
Copy link
Contributor

Initial prototype for FP16 Igemm support for SME2
continuing work from ##8687

gmiodice and others added 4 commits October 20, 2025 09:24
- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>
Signed-off-by: Gian Marco Iodice <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
Signed-off-by: Jonathan Clohessy <[email protected]>
@dsharlet
Copy link
Collaborator

This isn't building for us:

test/gemm-microkernel-tester.cc:2455:40: error: no viable overloaded '+='
 2455 |         c_ref[m_index * n() + n_index] +=
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
 2456 |             xnn_float16_to_float(input_f16[m_index * k() + k_index]) *
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 2457 |             xnn_float16_to_float(weights[n_index * k() + k_index]);
      |             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test/gemm-microkernel-tester.cc:2459:38: error: no viable overloaded '+='
 2459 |       c_ref[m_index * n() + n_index] += xnn_float16_to_float(bias[n_index]);
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated.

@JonathanC-ARM
Copy link
Contributor Author

JonathanC-ARM commented Oct 21, 2025

Hi @dsharlet could you give a bit more context of the error, particularly around the build command used. The strange thing on my end is that I cant see this.

I am compiling on an M4 however, but I'm going to try on an x86_64 machine shortly and cross compile

bazel build -c opt --enable_bzlmod --define xnn_enable_arm_sme=true --define xnn_enable_arm_sme2=true //test:gemm_microkernel_tester

Tried a few variations on the command, cleaned my environment etc. Also synced my fork with master in case anything since.

copybara-service bot pushed a commit that referenced this pull request Oct 21, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 69ccf09
PiperOrigin-RevId: 821598958
Aelphy and others added 12 commits October 21, 2025 16:16
No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068
…ally

long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723
PiperOrigin-RevId: 821694771
This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594
PiperOrigin-RevId: 821708108
According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217
PiperOrigin-RevId: 821857188
PiperOrigin-RevId: 821867761
@JonathanC-ARM
Copy link
Contributor Author

@dsharlet thanks for telling me about the build problem, seemed to only show up on Linux machines. I was able to fix the build issue in the latest commit.

Will be resolving the conflicts with Master shortly.

copybara-service bot pushed a commit that referenced this pull request Oct 22, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 56ee7cb
PiperOrigin-RevId: 821598958
@gonnet
Copy link
Collaborator

gonnet commented Oct 22, 2025

This is still failing to build for the CI workflows, e.g. https://github.com/google/XNNPACK/actions/runs/18713631848/job/53367695764.

copybara-service bot pushed a commit that referenced this pull request Oct 29, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm 9efa3d6
PiperOrigin-RevId: 821598958
@JonathanC-ARM
Copy link
Contributor Author

Hi @dsharlet I made some additional changes, and ran all of //test/... with sme2 on/off and vice versa. Everything seemed to pass testing, I was able to replicate the original failures and work through them. So I think it should be all good now.

Thanks

Signed-off-by: Jonathan Clohessy <[email protected]>
@dsharlet
Copy link
Collaborator

There were some issues with build timeouts earlier. I re-ran the failed builds, there is a remaining real build issue:

C:\Users\runneradmin\.cargo\bin\ccache.exe C:\PROGRA~1\MICROS~2\2022\ENTERP~1\VC\Tools\MSVC\1444~1.352\bin\Hostx64\arm64\cl.exe  /nologo /TP -DNOMINMAX -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ARM_BF16=0 -DXNN_ENABLE_ARM_DOTPROD=1 -DXNN_ENABLE_ARM_FP16_SCALAR=0 -DXNN_ENABLE_ARM_FP16_VECTOR=1 -DXNN_ENABLE_ARM_I8MM=1 -DXNN_ENABLE_ARM_SME2=0 -DXNN_ENABLE_ARM_SME=1 -DXNN_ENABLE_ASSEMBLY=0 -DXNN_ENABLE_AVX256SKX=1 -DXNN_ENABLE_AVX256VNNI=1 -DXNN_ENABLE_AVX256VNNIGFNI=1 -DXNN_ENABLE_AVX2=1 -DXNN_ENABLE_AVX512AMX=1 -DXNN_ENABLE_AVX512BF16=0 -DXNN_ENABLE_AVX512F=1 -DXNN_ENABLE_AVX512FP16=0 -DXNN_ENABLE_AVX512SKX=1 -DXNN_ENABLE_AVX512VBMI=1 -DXNN_ENABLE_AVX512VNNI=1 -DXNN_ENABLE_AVX512VNNIGFNI=1 -DXNN_ENABLE_AVX=1 -DXNN_ENABLE_AVXVNNI=1 -DXNN_ENABLE_AVXVNNIINT8=0 -DXNN_ENABLE_CPUINFO=1 -DXNN_ENABLE_F16C=1 -DXNN_ENABLE_FMA3=1 -DXNN_ENABLE_HVX=1 -DXNN_ENABLE_KLEIDIAI=0 -DXNN_ENABLE_RISCV_VECTOR=1 -DXNN_ENABLE_SPARSE=1 -DXNN_ENABLE_SSE2=1 -DXNN_ENABLE_SSE41=1 -DXNN_ENABLE_SSE=1 -DXNN_ENABLE_SSSE3=1 -DXNN_ENABLE_VSX=1 -DXNN_ENABLE_WASM_REVECTORIZE=0 -DXNN_LOG_LEVEL=0 -IC:\a\XNNPACK\XNNPACK\include -IC:\a\XNNPACK\XNNPACK\build\windows\arm64\pthreadpool-source\include -external:IC:\a\XNNPACK\XNNPACK\. -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googlemock -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest\include -external:IC:\a\XNNPACK\XNNPACK\build\windows\arm64\googletest-source\googletest -external:W0 /UNDEBUG  /DWIN32 /D_WINDOWS /GR /EHsc /O2 /Ob2 /DNDEBUG -std:c++14 -MD /wd4146 /bigobj /wd4190 /O2 /DEBUG:FASTLINK /Zi /showIncludes /Fotest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.cc.obj /Fdtest\CMakeFiles\gemm-microkernel-tester.dir\gemm-microkernel-tester.pdb /FS -c C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2864): error C3861: 'xnn_packed_size_kai_f16_conv_goki_w': identifier not found
C:\a\XNNPACK\XNNPACK\test\gemm-microkernel-tester.cc(2870): error C3861: 'xnn_pack_kai_f16_conv_goki_w_sme': identifier not found

@JonathanC-ARM
Copy link
Contributor Author

I just made some small tweaks for ifdef's which meant this stuff was getting into non kleidi builds.
bazel test --compilation_mode=opt --define xnn_enable_assembly=false --define xnn_enable_arm_fp16_scalar=false --define xnn_enable_arm_bf16=false --define xnn_enable_kleidiai=false //test/... Was able to see the failure resolved it and from what I was able to test on my end it should be working now.

copybara-service bot pushed a commit that referenced this pull request Oct 31, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>

--
999f4e3 by Jonathan Clohessy <[email protected]>:

Updated code with sme variants of kernels and fixed tests

Signed-off-by: Jonathan Clohessy <[email protected]>

--
a2bd7aa by Jonathan Clohessy <[email protected]>:

Updated ifdef guards and yml file

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm a2bd7aa
PiperOrigin-RevId: 821598958
@JonathanC-ARM JonathanC-ARM reopened this Nov 2, 2025
@dsharlet
Copy link
Collaborator

dsharlet commented Nov 2, 2025

This is crashing in some of our tests. It seems like the symptom is relatively simple, this function pointer call:

context->ukernel.function[XNN_UARCH_DEFAULT](

Is trying to call this function: https://github.com/google/XNNPACK/pull/9005/files#diff-017fa5d842d8909aebb30be2f3f22e7c785374b4c7e1b26ac6ad1a5853efc794R40-R43

The arguments don't match, and one of the arguments that doesn't match is params which is getting passed context->cn_stride => crash.

I suspect the problem is that we are missing an LH packing config here?

case xnn_operator_type_convolution_nhwc_pqs8_qs8_qc8w:
if (inline_lhs_packing) {
packed_lh_config = xnn_init_x8_igemm_pack_lh_config();
}
break;
The presence of that seems like it will affect which operator-run code will execute.

Aside from the problem itself, I'm concerned about why our operator or subgraph tests didn't catch this? We need to make sure we have test coverage from the subgraph API for this before we merge it.

copybara-service bot pushed a commit that referenced this pull request Nov 2, 2025
I'm investigating a bug from #9005, and discovered that many of these codepaths are rarely or never exercised by our tests, because there are just too many of them (and some of them are simply dead code). The probability of groups = 1 and batches = 1 in a randomized test is low, and when running on slow emulators, we don't get many chances. We also have different code for ARM vs. not-ARM (via `XNN_MAX_UARCH_TYPES`), which again forks our test coverage.

I don't think these specializations are worth the cost (the constant vigilance required to ensure we don't lose test coverage of all of these paths).

PiperOrigin-RevId: 826767227
@dsharlet
Copy link
Collaborator

dsharlet commented Nov 2, 2025

I sent #9005 which attempts to reduce the number of codepaths that are relevant here. When I was trying to investigate the crash, I found that there are many different codepaths that could be used, and we only test a few of them. However, I don't believe that addresses the bug in this case.

copybara-service bot pushed a commit that referenced this pull request Nov 3, 2025
I'm investigating a bug from #9005, and discovered that many of these codepaths are rarely or never exercised by our tests, because there are just too many of them (and some of them are simply dead code). The probability of groups = 1 and batches = 1 in a randomized test is low, and when running on slow emulators, we don't get many chances. We also have different code for ARM vs. not-ARM (via `XNN_MAX_UARCH_TYPES`), which again forks our test coverage.

I don't think these specializations are worth the cost (the constant vigilance required to ensure we don't lose test coverage of all of these paths).

PiperOrigin-RevId: 826767227
@JonathanC-ARM
Copy link
Contributor Author

Hi @dsharlet I've made some changes and added a new test case, which exercises the failing code path. So the test passes currently but if you comment out the following in src/operators/convolution-nhwc.c line 2200ish as such

   case xnn_operator_type_convolution_nhwc_pf16:
        if (inline_lhs_packing) {
          // packed_lh_config = xnn_init_x16_igemm_pack_lh_config();
        }
        break

This will cause the test to segmentation fault for the reason you described previously. This update should now be passing all tests that I am aware of.


xnn_subgraph* Subgraph() const { return subgraph_.get(); }

// Utility to help force inline LHS packing for the last convolution node (pf16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddConvolution2D has a flags parameter, can you just use that instead of adding this helper?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: adding an include of operator.h in this file is breaking the bazel build. Rather than fixing that, I'd rather just revert the changes in this file and use the flags parameter.

Copy link
Contributor Author

@JonathanC-ARM JonathanC-ARM Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dsharlet i dropped out this change and went with setting the flags in the test. Also appologies for the break in the build, the strange thing is that this built on m4 no problem with bazel so I was unaware. I tested the latest change on ubuntu and m4, so should be good.

copybara-service bot pushed a commit that referenced this pull request Nov 17, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>

--
999f4e3 by Jonathan Clohessy <[email protected]>:

Updated code with sme variants of kernels and fixed tests

Signed-off-by: Jonathan Clohessy <[email protected]>

--
a2bd7aa by Jonathan Clohessy <[email protected]>:

Updated ifdef guards and yml file

Signed-off-by: Jonathan Clohessy <[email protected]>

--
551cfde by Jonathan Clohessy <[email protected]>:

Add new test case and fix issue with LHS pack

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm f62aea6
PiperOrigin-RevId: 833326167
copybara-service bot pushed a commit that referenced this pull request Nov 18, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>

--
999f4e3 by Jonathan Clohessy <[email protected]>:

Updated code with sme variants of kernels and fixed tests

Signed-off-by: Jonathan Clohessy <[email protected]>

--
a2bd7aa by Jonathan Clohessy <[email protected]>:

Updated ifdef guards and yml file

Signed-off-by: Jonathan Clohessy <[email protected]>

--
551cfde by Jonathan Clohessy <[email protected]>:

Add new test case and fix issue with LHS pack

Signed-off-by: Jonathan Clohessy <[email protected]>

--
bcc62a0 by Jonathan Clohessy <[email protected]>:

Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags instead

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm bcc62a0
PiperOrigin-RevId: 833326167
copybara-service bot pushed a commit that referenced this pull request Nov 18, 2025
--
c69ccdb by Gian Marco Iodice <[email protected]>:

Prototype: Add support for fp16 iGEMM with SME2

- Initial prototype to enable fp16 iGEMM with SME2 in conv2d

Signed-off-by: Gian Marco Iodice <[email protected]>

--
a3537a1 by Gian Marco Iodice <[email protected]>:

Include missing files

Signed-off-by: Gian Marco Iodice <[email protected]>

--
232826c by Gian Marco Iodice <[email protected]>:

Update FP16 iGEMM based on review comments

Signed-off-by: Gian Marco Iodice <[email protected]>

--
03bccaa by Jonathan Clohessy <[email protected]>:

Updated FP16 iGemm Review with Fixes

Signed-off-by: Jonathan Clohessy <[email protected]>

--
9cd6e88 by Jonathan Clohessy <[email protected]>:

Fix rebase issues

Signed-off-by: Jonathan Clohessy <[email protected]>

--
7eb618d by Misha Gutman <[email protected]>:

Added multiple_of to handle all multiples in reductions simply.

No significant performance loss:

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.720µ ± 0%   1.719µ ± 17%       ~ (p=0.485 n=6)
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.744µ ± 3%   1.753µ ± 14%       ~ (p=0.310 n=6)
bench/sum_uint8_int32_4x64_avx512bw/real_time [256x1x256x1] 1.218µ ± 1%   1.216µ ± 17%       ~ (p=0.818 n=6)
bench/sum_int8_int32_4x64_avx512bw/real_time  [256x1x256x1] 1.217µ ± 0%   1.216µ ± 15%       ~ (p=0.699 n=6)
bench/sum_fp32_4x16_avx512f/real_time         [256x1x256x1] 2.263µ ± 1%   2.268µ ±  0%       ~ (p=0.394 n=6)
bench/sum_fp32_4x8_avx2/real_time             [256x1x256x1] 4.342µ ± 0%   4.357µ ±  0%       ~ (p=0.065 n=6)
bench/sum_uint8_int32_4x32_avx2/real_time     [256x1x256x1] 2.221µ ± 0%   2.285µ ±  8%       ~ (p=0.065 n=6)
bench/sum_int8_int32_4x32_avx2/real_time      [256x1x256x1] 2.219µ ± 1%   2.279µ ±  2%  +2.70% (p=0.002 n=6)
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.344µ ± 0%   2.345µ ±  7%       ~ (p=0.485 n=6)
bench/sum_uint8_int32_4x16_sse41/real_time    [256x1x256x1] 4.318µ ± 0%   4.328µ ±  0%  +0.22% (p=0.015 n=6)
bench/sum_int8_int32_4x16_sse41/real_time     [256x1x256x1] 4.319µ ± 0%   4.325µ ±  1%       ~ (p=0.394 n=6)
bench/sum_fp32_4x4_sse2/real_time             [256x1x256x1] 8.790µ ± 0%   8.795µ ±  0%       ~ (p=0.394 n=6)
bench/sum_uint8_int32_4x16_sse2/real_time     [256x1x256x1] 3.966µ ± 0%   3.995µ ±  0%  +0.73% (p=0.002 n=6)
bench/sum_int8_int32_4x16_sse2/real_time      [256x1x256x1] 5.382µ ± 1%   5.410µ ±  1%  +0.52% (p=0.041 n=6)
bench/sum_uint8_int32_4x16_ssse3/real_time    [256x1x256x1] 3.977µ ± 0%   3.994µ ±  1%  +0.44% (p=0.004 n=6)
bench/sum_int8_int32_4x16_ssse3/real_time     [256x1x256x1] 5.373µ ± 0%   5.412µ ±  2%  +0.72% (p=0.002 n=6)

PiperOrigin-RevId: 821549068

--
e5cb8c0 by Misha Gutman <[email protected]>:

Changed K1_1 strategy for f32 to go with single accumulator and maximally
long multiple, this significantly improved performance.
Since contiguous case tiles became different from discontiguous changed the
naming to not include tiles information.

bench/sum_fp32_4x16_avx512f/real_time [256x1x256x1] 2.259µ ± 1%
bench/sum_fp32_4x8_avx2/real_time     [256x1x256x1] 4.339µ ± 0%
bench/sum_fp32_4x4_sse2/real_time     [256x1x256x1] 8.787µ ± 1%
bench/sum_fp32/real_time              [256x1x256x1] 3.255µ ± 7%
bench/sum_fp32_avx512f/real_time [256x1x256x1]      1.441µ ± 17%
bench/sum_fp32_avx2/real_time    [256x1x256x1]      1.761µ ± 14%
bench/sum_fp32_sse2/real_time    [256x1x256x1]      3.435µ ± 13%
bench/sum_fp32/real_time         [256x1x256x1]      3.261µ ± 13%

bench/sum_bf16_fp32_4x32_avx512bf16/real_time [256x1x256x1] 1.722µ ± 1%
bench/sum_bf16_fp32_avx512bf16/real_time [256x1x256x1]      1.703µ ± 1%
bench/sum_fp16_fp32_4x32_avx512fp16/real_time [256x1x256x1] 1.749µ ± 0%
bench/sum_fp16_fp32_avx512fp16/real_time [256x1x256x1]      1.744µ ± 0%
bench/sum_fp16_fp32_4x16_f16c/real_time       [256x1x256x1] 2.341µ ± 1%
bench/sum_fp16_fp32_f16c/real_time       [256x1x256x1]      1.652µ ± 7%

PiperOrigin-RevId: 821556723

--
aeeca5d by Dillon Sharlet <[email protected]>:

Remove threadpool library and just build threadpool.cc as part of subgraph

PiperOrigin-RevId: 821566586

--
7304027 by Dillon Sharlet <[email protected]>:

Disable SME when msan is enabled

PiperOrigin-RevId: 821694771

--
89a72e3 by Dillon Sharlet <[email protected]>:

Don't bother disabling KleidiAI if using YNNPACK

This causes builds to fail, and it's harmless to leave it enabled.

PiperOrigin-RevId: 821704594

--
0c5edfc by Dillon Sharlet <[email protected]>:

Disable SME on older Apple compilers

PiperOrigin-RevId: 821708108

--
9b29972 by Dillon Sharlet <[email protected]>:

Fix usage of `sv{ld,st}1_hor_vnum_za32`

According to the ACLE documentation, this increments *both* the slice and the pointer by `vnum` vectors. This usage of it treated it as if it only incremented the pointer to read from/write to by 1 vector (but did not change the slice).

This is interesting because this code worked on QEMU, but fails on real (Apple M4) hardware. I think this indicates there is a bug in the implementation of these instructions in QEMU.

PiperOrigin-RevId: 821730217

--
0d3dc09 by Dillon Sharlet <[email protected]>:

Fix correctness of dot benchmarks for transpose_a kernels

PiperOrigin-RevId: 821808685

--
4b73eb1 by Pedro Gonnet <[email protected]>:

Update `pthreadpool` dependency.

PiperOrigin-RevId: 821857188

--
66d084b by Dillon Sharlet <[email protected]>:

Fix flaky quantize tests

PiperOrigin-RevId: 821867761

--
6fc5696 by Quentin Khan <[email protected]>:

Add missing `gemm_config` `.element_size` initializations.

PiperOrigin-RevId: 821984759

--
923b7f9 by Jonathan Clohessy <[email protected]>:

Fix build issues and guard against sme2 specific path

Signed-off-by: Jonathan Clohessy <[email protected]>

--
06a44d2 by Jonathan Clohessy <[email protected]>:

Refactor Convolution to new structure and fix build failures

Signed-off-by: Jonathan Clohessy <[email protected]>

--
175903d by Jonathan Clohessy <[email protected]>:

Remove unused gemm config structure init

Signed-off-by: Jonathan Clohessy <[email protected]>

--
999f4e3 by Jonathan Clohessy <[email protected]>:

Updated code with sme variants of kernels and fixed tests

Signed-off-by: Jonathan Clohessy <[email protected]>

--
a2bd7aa by Jonathan Clohessy <[email protected]>:

Updated ifdef guards and yml file

Signed-off-by: Jonathan Clohessy <[email protected]>

--
551cfde by Jonathan Clohessy <[email protected]>:

Add new test case and fix issue with LHS pack

Signed-off-by: Jonathan Clohessy <[email protected]>

--
bcc62a0 by Jonathan Clohessy <[email protected]>:

Removed ForceInlineLhsPackingPf16OnLastConv and use runtime flags instead

Signed-off-by: Jonathan Clohessy <[email protected]>
FUTURE_COPYBARA_INTEGRATE_REVIEW=#9005 from JonathanC-ARM:f16_igemm bcc62a0
PiperOrigin-RevId: 833326167
@copybara-service copybara-service bot merged commit 6dbb696 into google:master Nov 18, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants